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Abstract 


We consider codes over fixed alphabets against worst-case symbol deletions. For any fixed k'^ 2, 
we construct a family of codes over alphabet of size k with positive rate, which allow efficient recovery 
from a worst-case deletion fraction approaching 1 — In particular, for binary codes, we are able 

to recover a fraction of deletions approaching l/(v^-l- 1) = a/ 2 — 1 ~ 0.414. Previously, even non- 
constructively the largest deletion fraction known to be correctable with positive rate was 1 — 0(1 /Vk), 
and around 0.17 for the binary case. 


Our result pins down the largest fraction of correctable deletions fork-ary codes as 1 — 0(1/k), since 
1 — 1 /k is an upper bound even for the simpler model of erasures where the locations of the missing 
symbols are known. 


Closing the gap between (a/2 — 1) and 1/2 for the limit of worst-case deletions correctable by binary 
codes remains a tantalizing open question. 


1 Introduction 

This work concerns error-correcting codes capable of correcting worst-case deletions. Specifically, consider 
a fixed alphabef [k] = {1,2,..., k}, and suppose we fransmif a sequence of n symbols from [k] over a channel 
fhaf can adversarially delete an arbifrary fraction p of symbols, resulting in a subsequence of lengfh (1 — p)n 
being received af fhe ofher end. The locafions of fhe delefed symbols are unknown fo fhe receiver. The goal 
is fo design a code C C [k]” such fhaf every c G C can be uniquely recovered from any of ifs subsequences 
caused by up fo pn deletions. Equivalenfly, for c / c G C, fhe lengfh of fhe longesf common subsequence of 
c,c, which we denofe by LCS(c,c), musf be less fhan (1 — p)n. 

In fhis work, we are inferesfed in fhe quesfion of correcting as large a fraclion p of deletions as possible 
wifh codes of positive rale (bounded away from 0 for n —)• oo). Thai is, we would like |C| ^ exp(klyt(n)) so 
fhaf fhe code incurs only a conslanl factor redundancy (fhis factor could depend on k, which we Ihink of as 
fixed). 

* A preliminary conference version of this paper [2], with a weaker bound of 1 — 2/(k -|- 1) on fraction of correctable deletions, 
was presented at the 2016 ACM-SIAM Symposium on Discrete Algorithms (SODA) in January 2016. 
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Denote by p*{k) the limit superior of all p G [0,1] such that there is a positive rate code family over 
alphabet [A:] that can correct a fraction p of deletions. The value of p* (k) is not known for any value of k. 
Clearly, p*{k) ^ \ — \/k — indeed, one can delete all but n/k occurrences of the most frequent symbol in a 
word to leave one of k possible subsequences, and therefore only trivial codes with k codewords can correct 
a fraction 1 — 1/k of deletions. This trivial limit remains the best known upper bound on p*{k). We note that 
this upper bound holds even for the simpler model of erasures where the locations of the missing symbols 
are known at the receiver (this follows from the so-called Plotkin bound in coding theory). 

Whether the trivial upper bound p*{k) ^ I — 1 /k can be improved, or whether there are in fact codes 
capable of correcting deletion fractions approaching 1 — 1 /k is an outstanding open question concerning 
deletion codes and the combinatorics of longest common subsequences. Perhaps the most notable of these 
is the k = 2 (binary) case. The current best lower bound on p*{2) is around 0.17. This bound comes from 
the random code, in view of the fact that the expected LCS of two random words in {0,1}" is at most 
0.8263n [ 8 ]. As the LCS of two random words in {0,1}" is at least 0.788n, one cannot prove any lower 
bound on p*{2) better than 0.22 using the random code. Kiwi, Loebl, and Matousek [7] showed that, as 
k —)• oo, we have IE[LCS(c,c)] ~ -^n for two random words c,c G [k]". This was used in [ 6 ] to deduce 

p*{k)-^l-0{l/Vk). 

The above discussion only dealt with the existence of deletion codes. Turning to explicit and efficiently 
decodable constructions, Schulman and Zuckerman [11] constructed constant-rate binary codes which are 
efficiently decodable from a small constant fraction of worst-case deletions. This was improved in [ 6 ] ; in 
the new codes, the rate approaches 1. Specifically, it was shown that one can correct a fraction ^ > 0 of 
deletions with rate about 1 — 0{y/X). In terms of correcting a larger fraction of deletions, codes that are 
efficiently decodable from a fraction 1 — 7 of errors over a poly (1 //) sized alphabet were also given in [ 6 ] . 

Our focus in this work is exclusively on the worst-case model of deletions. For random deletions, it 
is known that reliable communication at positive rate is possible for deletion fractions approaching 1 even 
in the binary case. We refer the reader interested in coding against random deletions to the survey by 
Mitzenmacher [9]. 

1.1 Our results 

Here we state our results informally, omitting the precise computational efficiency guarantees, and omitting 
the important technical properties of constructed codes related to the “span” of common subsequences (see 
Section 2 for the definition). The precise statements are in Subsection 4.2 and in Section 5. 

Our first result is a construction of codes which are combinatorially capable of correcting a larger fraction 
of deletions than was previously known to be possible. 

Theorem 1 (Informal). For all integers k^2, p*{k) ^ \ — Furthermore, for any desired e > 0, there 
is an efficiently constructible family ofk-ary codes of rate r{k,e) > 0 such that the LCS of any two distinct 
codewords is less than fraction F ^ of the code length. In particular, there are explicit binary codes 

that can correct a fraction {\f2— l—e)> 0.414 — e of deletions, for any fixed £ > 0. 

Note that, together with the trivial upper bound p*(k) ^ 1 — l/k, the result pins down the asymptotics of 
1 — p*{k) to 0(1 /k) as k —>• 00 . Interestingly, our result shows that deletions are easier to correct than errors 
(for worst-case models), as one cannot correct a fraction 1/4 of worst-case errors with positive rate. 

In our second result we construct codes with the above guarantee together with an efficient algorithm to 
recover from deletions: 
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Theorem 2 (Informal). For any integer k ^ 2 and any e > 0, there is an efficiently constructible family of 
k-ary codes of rate r{k,s) > 0 that can be decoded in polynomial (in fact near-linear) time from a fraction 


1 - 


— e of deletions. 


1.2 Our techniques 

All our results are based on code concatenations, which use an outer code over a large alphabet with desir¬ 
able properties, and then further encode the codeword symbols by a judicious inner code. The inner code 
comes in two variants, one clean and simpler form and then a dirty more complicated form giving a slightly 
more involved and better bounds. For simplicity let us here describe the clean construction which when 
analyzed gives the slightly worse bound 1 — -j^ as compared to 1 — This weaker bound appears in 
the preliminary conference version [ 2 ] of this paper. 

The innermost code consists of words of the form (1^2^... for integers A,L with A dividing 

L, where stands for the letter a repeated A times. Informally, we think of these words as oscillating 
with amplitude A (this can be made precise via Fourier transform for example, but we won’t need it in 
our analysis). The crucial property, that was observed in [4], is that two such words have a long common 
subsequence only if their amplitudes are close. This property was also exploited in [3] to show a certain 
weak limitation of deletion codes, namely that in any set of t ^ k + 2 words in [k]”, some two of them have 

anLCS at least f+ 

The effective use of these codes as inner codes in a concatenation scheme relies on a property stronger 
than absence of long common subsequences between codewords. Informally, the property amounts to ab¬ 
sence of long common subsequences between subwords of codewords. For the precise notion, consult the 
definition of a span in the next section and the statement of Theorem 4 in the following section. Using this, 
we are able to show that if the outer code has a small LCS value, then the LCS of the concatenated code 
approaches a fraction of the block length. 

For the outer code, the simplest choice is the random code. This gives the existential result (Theorem 14). 
Using the explicit construction of codes to correct a large fraction of deletions over fixed alphabets from [ 6 ] 
gives us a polynomial (in fact near-linear) time deterministic construction (Theorem 16). While the outer 
code from [ 6 ] is also efficiently decodable from deletions, it is not clear how to exploit this to decode the 
concatenated code efficiently. 

To obtain codes that are also efficiently decodable, we employ another level of concatenation, using 
Reed-Solomon codes at the outermost level, and the above explicit concatenated code itself as the inner 
code. The combinatorial LCS property of these codes is established similarly, and is in fact easier, as we 
may assume (by indexing each position) that all symbols in an outer codeword are distinct, and therefore the 
corresponding inner codewords are distinct. To decode the resulting concatenated code, we try to decode 
the inner code (by brute-force) for many different contiguous subwords of the received subsequence. A 
small fraction of these are guaranteed to succeed in producing the correct Reed-Solomon symbol. The 
decoding is then completed via list decoding of Reed-Solomon codes. The approach here is inspired by the 
algorithm for list decoding binary codes from a deletion fraction approaching 1/2 in [ 6 ]. Our goal here is 
to recover the correct message uniquely, but by virtue of the combinatorial guarantee, there can be at most 
one codeword with the received word as a subsequence, so we can go over the (short) list and identify the 
correct codeword. Note that list decoding is used as an intermediate algorithmic primitive even though our 
goal is unique decoding; this is similar to [5] that gave an algorithm to decode certain low-rate concatenated 
codes up to half the Gilbert-Varshamov bound via a list decoding approach. 
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2 Preliminaries 


A word is a sequence of symbols from a finite alphabet. For the problems of this paper, only the size of the 
alphabet and the length of the word are important. So, we will often use [k] for a canonical ^-letter alphabet, 
and consider the words indexed by [n]. In this case, the set of words of length n over alphabet [^] will be 
denoted [fc]”. We treat symbols in a word as distinguishable. So, if x denotes the second 1 in the word 21011 
and we delete the subword 10, the variable x now refers to the first 1 in the word 211. 

Below we define some terminology about subsequences that we will use throughout the paper: 

• A subsequence in a word w is any word obtained from w by deleting one or more symbols. In contrast, 
a subword is a subsequence made of several consecutive symbols of w. 

• The span of a subsequence w' in a word w is the length of the smallest subword containing the 
subsequence. We denote it by span^w', or simply by spanw' when no ambiguity can arise. 

• A common subsequence between words wi and W 2 is a pair of subsequences Wj in wi and 

w '2 in W 2 that are equal as words, i.e., lenwj = lenw^ and the /’th symbols of w\ and w '2 are equal for 
each i, 1 ^ ^ lenwj. 

• For words wi,W 2 , we denote by LCS(h'i,h' 2 ) the length of the longest common subsequence of wi 
and W 2 , i.e., the largest J for which there is a common subsequence between wi and W 2 of length j. 

A code C of block length n over the alphabet [k] is simply a subset of [k]”. We will also call such codes 
as k-ary codes, with binary codes referring to the k = 2 case. The rate of C equals . 

For a code C C [k]”, its LCS value is defined as 

LCS(C)= max LCS(ci,C 2 ) . 

Ci^C2€.C 

Note that a code C C [k]" is capable of recovering from t worst-case deletions if and only if LCS (C) <n — t. 
We define the span of a common subsequence (w^w^) of words wi and W 2 as 

span (w 1 , W 2 ) = span^j w\ + span^^ ^2 • 

The span will play an important role in our analysis of LCS(C) of the codes C we construct, by virtue of the 
fact that if span {w\ ^wf) '^b- len w\ for every common subsequence of w 1 , W 2 G [k] ”, then LCS (w 1 , ^ 2 ) ^ ^. 
Our result will be based on a construction for which we can take b ^ k + s/k for long enough common 
subsequences of any distinct pair of codewords. 

Concatenated codes. Our results heavily use the simple but useful idea of code concatenation. Given 
an outer code Cout C [2]”, and an injective map t: [Q]^ [q]'" defining the encoding function of an in¬ 
ner code Cin, the concatenated code Cconcat C [qf” is obtained by composing these codes as follows. If 
(ci,C 2 ,... ,c„) G [QY is a codeword of Cout, the corresponding codeword in Cconcat is (t(ci), ...,t(c„)) G 
[qf”. The words t(c,) G Cin will be referred to as the inner blocks of the concatenated codeword, with the 
/’th block corresponding to the /’th outer codeword symbol. 
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3 Alphabet reduction for deletion codes 


Fix k to be the alphabet size of the desired deletion code. We shall show how to turn words over .K'-letter 
alphabet, for K ^ k, without large common subsequence into words over ^-letter alphabet without large 
common subsequence. More specifically, for any £ > 0 and large enough integer K = K{e), we give a 
method to transform a deletion code C\ C [A']" with LCS(Ci) <C en into a deletion code C 2 C [k]^ with 
LCS(C 2 ) ^ + s)N. The transformation lets us transform a crude dependence between the alphabet 

size of the code Ci and its LCS value (i.e., between K and £), into a quantitatively strong one, namely 
LCS(C 2 ) ~ k+Vk ^' obtained by concatenating C\ with an inner k-ary code with 

K codewords, and therefore has the same cardinality as C\. The block length N of C 2 will be much larger 
than n, but the ratio N/n will be bounded as a function of k,K, and £. The rate of C 2 will thus only be a 
constant factor smaller than that of C\. 

Specifically, we prove the following. 

Theorem 3. Let C\ C [K^ be a code with LCS(Ci) = yn, and let k^ 2 be an integer. Then there exists an 
integerT = T{K,Y,k) satisfying T ^ 0((2k/7)^^+^), and an injective map T: [K] ^ [kY such that the code 
C 2 C [k]^ for N = nT obtained by replacing each symbol in codewords of C\ by its image under r has the 
following property: if s is a common subsequence between two distinct codewords c,c G C 2 , then 

svpans^ {k-Ls/k)\&ns— 5ykN . (1) 

In particular, since span 5 ^ 2N. ,«;.»v*LCS(C,) < {y^)N< (i^ + 5r)iV. 


Thus, one can construct codes over a size k alphabet with LCS value approaching by starting with 
an outer code with LCS value 7 —>• 0 over any fixed size alphabef, and concafenafing if wifh a consfanf-sized 
map. The span properfy will be useful in concafenafed schemes fo gel longer, efficienlly decodable codes. 

The key fo Ihe above conslruclion is Ihe inner map, which come in Iwo varianls, one “clean” and one 
“dirfy” form. The former is simpler fo describe and we choose fo do fhis firsl. 


3.1 The clean construction 

The aim of Ihe clean conslruclion is fo prove Ihe following: 

Theorem 4. Let C\ C [Kf be a code with LCS(Ci) = yn, and letk'^2 be an integer. Then there exists an 
integer T = T{K,y,k) satisfying T ^32- {2k/y)^, and an injective map x: [.S'] —)• [k]^ such that the code 
C 2 C \l<\^ for N = nT obtained by replacing each symbol in codewords ofC\ by its image under T has the 
following property: if s is a common subsequence between two distinct codewords c,c G C 2 , then 

span 5 ^ (k+ l)len 5 —. 

In particular, since span 5 ^ 2N,wehaveLCS{C2)^{^^)N<{j^ + 4y)N. 

We slarl by describing Ihe way fo encode symbols from Ihe alphabef [K] as words over [k] lhal underlies 
Theorem 4. Lef L be consfanl fo be chosen later. For an inleger A dividing L, define Ihe word of “amplifude 
A” fo be 

/A = (1^2^...k^)^/^. (2) 
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where stands for the letter a repeated A times. The crucial property of these words is that and /b have 
no long common subsequence if B/A is large (or small); for the proof see one of [4, 3]. In the present work, 
we will need a more general “asymmetric” version of this observation — we will need to analyze common 
subsequences in subwords of and fg (which may be of different lengths) 

Let R^kbean integer to be chosen later. For a word w over alphabet [A'] denote by vv the word obtained 
from w via the substitution 

le[K] ^ (3) 

to each symbol of w. Note that lenvv = kLlenw. If a symbol v E vv is obtained by expanding symbol y E w, 
then we say that y is a parent of x. 

3.1.1 Analysis of clean construction 

Lemma 5. For a natural number P, let be the (infinite) word 


Let A,B, where kA^B be natural numbers, and suppose s = {w\,w'fi) is a common subsequence between 
fX and fX- Then 

span5^ ^k + 1—len5 — 2(A+ B). 

Proof. The words /j° and are concatenations of chunks, which are subwords of the form l'^ and l^ 
respectively. A chunk in fX is spanned by subsequence w\ if the span of w\ contains at least one symbol 
of the chunk. Similarly, we define chunks spanned by in fX- We will estimate how many chunks are 
spanned by w\ and by vv^. 

As a word, a common subsequence is of the form k^k^-'-kX where ki / and the exponents 


are positive. The subsequence k^ spans at least k ki— -y i chunks in fX- Similarly, k^ spans at least 
+ 1 chunks in fX- Therefore the total number of symbols in chunks spanned by kf in both fX and 


Pl-B 

B 


in fX is at least 


HPi)=A{k 


Pi -A 


+ : 


1 I 




Pi-B 

B 


+ 1 


We then estimate ^{pi) according to whether pi ^ B: 

k{pi-A)+B 


(^{Pi) > 


if PI ^ B, 
k{pi - A) + k{pi - B) + B if Pi >B. 


In both cases we have 


kA\ 

4>{pi) > ( k+i-YjPi- 


Note that the chunks spanned by kf are distinct from chunks spanned by for I / I'. So, the total 
number of symbols in all chunks spanned by subsequence s in both fX and fX is least 
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len^. 














The total span of s might be smaller since the first and the last chunks in each of /j° and fg might not be 
fully spanned. Subtracting 2{A + B) to account for that gives the stated result. □ 


Let (W[,W 2 ) be a common subsequence between w\ and >V 2 . We say that the /’th symbol in (wjjW^) 
is well-matched if the parents of Wj[/] and of W 2 [/] are the same letter of [A"]. A common subsequence is 
badly-matched if none of its symbols are well-matched; see Figure 1 below for an example. 



Figure 1: A badly-matched common subsequence between w\ and W 2 for wi = 13321 and W 2 = 22131 


Lemma 6. Suppose >vi,W 2 are words over alphabet [A"] and s 
subsequence between w\ and m >2 as defined in (3). Then 


spanwj -|-spanw 2 ^ 


k 


= (>Vj,W 2 ) is a badly-matched common 
]\ens-l6R^^^. 


Proof. We subdivide the common subsequence 5 into subsequences ,..., such that, for each / = 1,..., r 
and each 7 = 1,2, the symbols matched by Si in w'j belong to the expansion of the same symbol in Wj. We 
choose the subdivision to be a coarsest one with this property (see Figure 2 below for an example). That 
implies that pairs of symbols of wi and W 2 matched by Si and by are different. In particular, expansions 
of at least r —4 symbols of wi and W 2 (except possibly the expansions of the leftmost and rightmost symbols 
of each of them) are fully contained in the spans of w[ and wf Therefore, we have 

Lk{r-A) ^ span 5 . 

Since (w^w^) is badly-matched, by the preceding lemma we then have 

span 5 ^ 1 — len^ —^ ^k-|- 1 — len^ —4A^^^ ’ 

The lemma then follows from the collecting together the two terms involving spanwj -|- spanw^, and then 
dividing by 1 -f 4A^^ ^ /Lk. □ 



Figure 2: Partition of the common subsequence from Figure 1 into subsequence as in the proof of Lemma 6 

The next step is to drop the assumption in Lemma 6 that the common subsequence is badly-matched. 

By doing so we incur an error term involving LCS(wi, W 2 ). 
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Lemma 7. Suppose wi, W 2 are words over alphabet [^] and s = {w \, W 2 ) is a common subsequence between 
wi and W 2 - Then 


span 5 ^ 



k 

R 


8 ^^ 


len 5 — 2 L^(^+ 1 ) •LCS(>vi,W 2 ) — 16/?^ ^ 


Proof. Without loss, the subsequence 5 is locally optimal, i.e., every alteration of 5 that increases len^ also 
increases span 5 . Define an auxiliary bipartite graph G whose two parts are the symbols in wi and the 
symbols in W 2 . For each well-matched symbol in s we join the parent symbols in w\ and W 2 by an edge. 

We may assume that each vertex in G has degree at most 2. Indeed, suppose a symbol x G wi is adjacent 
to three symbols yi,y 2 D 3 £ >V2 with y2 being in between yi and y3. Then we alter s by first removing all 
matches between x and y\,y 2 ■,y^, and then completely matching x with y 2 - The alteration does not increase 
span s, and the result is a common subsequence that is at least as long as s, and whose auxiliary graph has 
fewer edges. We can then repeat this process until no vertex has degree exceeding 2. 

Consider a maximum-sized matching in G. On one hand, it has at most LCS(wi,W 2 ) edges. On the 
other hand, since the maximum degree of G is at most 2, the maximum-sized matching has at least \E{G) | /2 
edges. Hence, |£'(G)| ^ 2LCS(wi,W2). 

Remove from s all well-matched symbols to obtain a common subsequence s'. The new subsequence 
satisfies 

len/ ^ lens-Lk- |£'(G)| ^ lens — lLk-GCS{w\_,W2). 

It is also clear that s' is a badly-matched common subsequence. From the previous lemma 


span/ ^ 



k 

R 


L 


lens— 2Lk{k+l) ■L,CS{wi.,W 2 ) — IbR^ 


Since span 5 ^ span/, the lemma follows. 


□ 


We are now ready to prove Theorem 4 by picking parameters suitably. 

Proof of Theorem 4. Recall that we are starting with a code Ci C [K]" with LCS(Ci) = yn. Given e > 0 and 
an integer k^2, pick parameters 


'2k 

and L = 16R^^ 

T 
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in the construction (2) and (3). Define T = kL and t: [A'] ^ [kY as z{l) = and let C 2 C [k]^, where 
N = nkL, be the code obtained as in the statement of Theorem 4. Note that T ^ 32 • {2k /yY by our choice 
of parameters. 

By Lemma 7, we can conclude that any common subsequence s of two distinct codewords of C 2 satisfies 


span5 ^ [k + l — y) len^ — 2{k +l)yN — yN . 

Since len^ ^ N and k^2, the right hand side is at least {k + 1) len^ — AkyN, as desired. □ 

Remark 1 (Bottleneck for analysis). We now explain why the analysis in Theorem 4 is limited to proving 
correctability of a 1/3 fraction of deletions for binary codes (a similar argument holds for larger alphabet 
size k). Imagine subwords of length 3 of >vi,W 2 G [K]" of the form abc and def respectively, where d > 








a,b and c > e,f. Then the word /^j-i can be matched fully with f^a-ifub-i (because the latter strings 
oscillate at a higher frequency that and similarly can be matched fully with fRe-ifut-i- Thus 

we can find a common subsequence of length 4L between the encoded bit strings fRa-ifut-ifuc-i G [2]^^ and 
fRd-ifRe-ifRt-i G [2]®^, even if abc and def share no common subsequence. 

3.2 Dirty construction 

We now turn to the more complicated “dirty” construction in which small runs of dirt are interspersed in the 
long runs of a single symbol from the clean construction. 


3.2.1 Dirty construction, binary case 


To convey the intuition for the dirty construction let us look more closely at what happened in the binary 
case. We were looking for subsequences of 


roo 

JA 


{ 1 ^ 2 ^)* 


and 

/b = (1^2^)* 

where both A and B are large numbers but B is much larger than A. We are interested in subsequences with 
small span. Looking more closely at the proof of Lemma 5 we see that such subsequences are obtained by 
taking every symbol of /J° and discarding essentially half the symbols of as to not interrupt the very 
long runs in fg. Now suppose we introduce some “dirt” in by introducing, in the very long stretches of 
I’s, some infrequent 2’s, say a 2 every 10th symbol (and similarly some infrequent I’s in the long stretches 
of 2’s). Then, during construction of the LCS, when running into such a sporadic 2 we can either try to 
include it or discard it. As A is a large number it is easy to see that while we are matching a 1-segment of 
we cannot profit by matching the sporadic 2’s. It is also not difficult to see that while passing through a 
2-segment of it is not profitable to match more than one sporadic 2 as matching two consecutive sporadic 
2’s forces us to drop the ten I’s in between the two matched 2’s in f^. The net effect is that introducing 
some dirt hardly enables us to expand the LCS but does increase the span. We need to introduce dirt in all 
codewords and it should not look too similar in any two codewords. The way to achieve this is by introducing 
such dirty runs of different but short lengths in all codewords. Let us turn to a more formal description. 

For the sake of readability we below assume that some real numbers defined are infegers. Rounding 
fhese numbers fo fhe closesf infeger only infroduces lower order ferm errors. If is also nol difficull fo see 
thaf we can pick paramefers such fhaf all numbers are indeed integers. 

Lef c be a such fhaf 0 ^ c < \/2 — 1. The reason for fhe upper limif on c will be clarified in Remark 2 
after fhe analysis. We define “M dirfy ones af amplifude a” be fhe sfring 


^ I (32£‘‘3 1+c)ia 


and lef us wrife fhis as 1^,^ leaving c implicif. We have an analogous sfring Im^ and we allow M = oo wifh 
the natural interpretation. Remember that in our clean solution, i was coded by 


A,-, = 


In the dirty construction we replace this by 


8 i — (1r'<^+i+',R'<^-‘)2kX+i+, 


(4) 
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where R is an integer that can be written on the form (1 + c)t for an integer t, and 


L = . 


( 5 ) 


We introduce dirt where the amplitude of the dirt decreases with i. We call a string of the form jj^K+i+i j^K-i 
as a segment of g,. The reason for the general length increase by a factor is to accommodate for dirt of 
frequencies that are well separated. 

Lemma 8. Let wi be the string loo^a (or 2oo,aj and let s be a subsequence ofw 2 = 

span^j s + 2bi ^ (3 + c) len^ - 

02 

Proof. As W 2 is symmetric in 1 and 2 we can assume that wi = loo,a. Note that W 2 consists of substrings of 
the form and 2 ^*^ and bi/{l +c)b 2 copies of each. For the /’th subword of ones (ignoring if it 

is of length b 2 or cb 2 ), let us assume that si I’s are contained in s. The the span of this subsequence in wi 
is at least {si/a — 1)(1 +c)a. Similarly if the /’th string of 2’s contain /, symbols from s then its span in wi 
is at least (ti/{ca) — 1)(1 +c)a. Summing these inequalities, if S is the total number of I’s in 5 and S is the 
number of 2 ’s, then the span of 5 in wi is at least 

/ X / X ~, 4ab\ 

(1 + c)S+(1 + c)5'/c- - — , 

02 

where the last term comes because we lose (1 +c)a in the span for each substring of identical symbols in 
W 2 and there are AbxjiX +c)b 2 such substrings. 

As the length of 5 is S + S it is sufficient to establish that 

(1+c)S+(1+c)5/c + 2/7i ^ (3 + c)(S + 5). (6) 

We know that both S and S are in the range [ 0 ,/ 7 i]. Since 0 ^ c < s/2 — 1 we have (1 +c)/c > (3 + c) and 
thus it is sufficient to establish ( 6 ) for 5 = 0, but in this case it follows from S ^bx. □ 


The above lemma is the main ingredient in establishing the the following lemma. 

Lemma 9. Let s be a subsequence of gi and gjfor i < j, then, provided R ^ 10, 

/ 2, lOL 

(1 + —) span^. 5 + span^^ s (3 + c) len5- — 

Proof We have that gi consists of substrings of each of the form 


1 ^ a :+ i +/ ,2j^K+i+i . 


Now partition s into substrings s^^'^ according to how it intersects these substrings of gi. The number of such 
strings is at most 2 + (span^. s) / (27?^+'+'). We want to apply Lemma 8 and we need to address the fact that 
each s^^^ might intersect more than one segment of gj (recall that a segment of gj is a substring of the form 
IjfK+i+jj^K-j or 2j^K+i+jj(K-j). As gj only has 2L//?^+^+^ different segments, by refining fhe partifion slightly 
we can obtain substrings s^^'> for k= 1 ,... with p ^2 + (span^w)/(27?^+^+') + 2L/7?^+^+4 where each 


10 




satisfies the hypothesis of Lemma 8 with a = ■>, bi = and b 2 = R^ '■ We therefore obtain 

the inequality 

span^^ (3 + c)len^^^) -. (7) 

We have a total of p inequalities and as span s ^L/tspan„.5W and len^ = l^j^len^^^^ summing (7) for the 

oj o] 

p values of k gives 

span^^5 + 2p/?^+'+i ^ (3 + c)len5-4p/?'^^/?^+'+^ 

Now as p ^ 2 + span^ 5 /(2/?^+^+') + 2L//?^+^+^ we can conclude that 

spang^.5+ (1 +2/?'^^)span^.5 ^ (3 + c)len5- (4/?^+‘+i +ALR‘-^ + %R‘-^R^+‘+^ +%R^^'-^^L) 

and using R ^ 10, 7?^+'+i ^ 7 ?’ and / y, the lemma follows. LI 

Let us slightly abuse notation and in this section let vv the word obtained from a word w via the substi¬ 
tution 

IG[K] ^ g, ( 8 ) 

to each symbol of w as opposed to (3). As Lemma 9 tells us that subsequences of codings of unequal 
symbols have a large span, we have the following analog of Lemma 6. 

Lemma 10. Suppose wi,W 2 are words over alphabet [.^f] and s = is a badly-matched common 

subsequence between vi>i and W 2 as defined in (8). Then 


spanw] +spanw2 ^ 




leni — 


40L 


Proof. We use the same subdivision as in the proof Lemma 6. We have 


2L(r-4) ^ span.?. 


Since (rv'^w^) is badly-matched, by the preceding lemma we then have 


2\ . , lOrL , , lOL.span^ 

1 + -j span^^ (3-hc)len5- — ^ (3-fc)len5'- —( -f4) 


The lemma then follows from the collecting together the two terms involving spanwj -|- spanw^, and then 
dividing by 1 □ 


The transition to allow some well-matched symbols is done as in the clean construction and we get the 
lemma below. The proof is analogous to that of Lemma 7 and in particular we remove the well matched 
symbols which is shortening s by at most 4L •LCS(wi,W 2 ) and the rest of the proof is essentially identical. 

Lemma 11. Suppose wi, W 2 are words over alphabet [A'] and s = {w \, IV 2 ) is a common subsequence between 
wi and W 2 . Then 

28 40L 

span^ ^ (3-l-c-)len5— 16L-LCS(wi,W2)-. 

R R 

We are now ready to prove the alphabet reduction claim (Theorem 3) via concatenation with the dirty 
construction at the inner level. 
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Proof of Theorem 3 (for binary case). All that remains to be done is to pick parameters suitably. We set R 
to the smallest number greater than y such that it can be written on the form {\+c)t for and integer t 

and c G [s/l — 1 — — 1] and we use this value of c. It is not difficult to see that this is possible with 

R G 0{j). Define T = 2L (recall that L = and t: —)• [2]^ as t(/) = gi (as defined in ( 8 )), and let 

C 2 C [2]^, where N = 2nL, be the code obtained as in the statement of Theorem 3. 

By Lemma 11, we can conclude that any common subsequence s of two distinct codewords of C 2 satisfies 

span 5 ^ (2 + V 2 — 7 )len 5 — 87 A — yN . 

Since len^ ^ N, the right hand side is at least (2 + \/2) len^ — IO 7 A, as claimed in (1). □ 

Remark 2. For the level of dirt discussed here, i.e., c ^ \/2 — 1, the analysis is optimal for the same reason 
as the clean one is optimal, as the analysis shows that the dirt is dropped in forming the subsequence. Indeed, 
in the clean construction the efficient LCS of length t spans 2t symbols in the high frequency string and t 
symbols in the low frequency string. Introducing dirt increases the second number to t(I +c) for a total 
span of (3 + c)f. If the value of c is larger, then the efficient LCS is obtained by using all symbols, including 
the dirt, in the low frequency (high amplitude) string. In the high frequency string it spans around 

-((I +c) + (l+c)/c)t 

symbols (half of the time we are taking the most common symbol, moving at speed (1 + c) and half the time 
the other symbol moving at speed (1 + c)/c). Thus in this case the total span isRit + (l + c)(l + l /c)t/2 = 
(2 + (c+1 /c)/ 2 )t and the threshold of (\/ 2 — 1) for c was chosen to maximize min(3 + c, (2 + (c + l/c)/2)). 


3.2.2 Dirty construction, general case 


Let us give the highlights of the general construction for alphabet size k. In this case we define “M dirty 
ones at frequency a” to be the string 

^Y^'2caya / (\-f-[k—\)c)a 

where we assume that c is positive number bounded from above by {Vk— \)/{k — \). We denote this string 
by I^ ^ and we have analogous dirty strings of other symbols. 

The extension of Lemma 8 is as follows. 


Lemma 12. For j G [k], letw\ be a string of the form ^ and let sbe a subsequence ofw 2 
then, 


span„,| s + kbi^ {k+I+ {k 


l)c)len 5 ' 


k^abi 

bi 


(-tk 2 ^ 3 ^ 

\'^bi,b2^bubibi,b2 


Kk 

■■%,b2 


The proof of this lemma follows along the lines of Lemma 8 with some obvious modifications. If we let 
S be the number of occurrences of /s in s and S the total number of other symbols we get a lower bound for 
the span of the form 

(1 + (k - l)c)5+ (1 + (k - l)c)S/c - — 

bi 

By the upper bound on c we have 


(1 + (k- l)c)/c ^ k + 1 + (k - l)c, 
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and we can again focus on 5 = 0 where again S establishes the lemma. The lemma establishes that the 
span of subsequences of coding of unequal symbols is large, and adopting the rest of the proof to establish 
Theorem 3 for general k is straightforward and we omit the details. 


4 Existence and construction of good deletion codes 


In this section, we will plug in good “outer” deletion codes over large alphabets into Theorem 3 to derive 
codes over alphabet [k] that correct a fraction « 1 — of deletions. 


4.1 Existential claims 


We start with “outer” codes over large alphabets guaranteed to exist by the probabilistic method. We use h{-) 
to denote the binary entropy function. A similar statement to the random coding argument below appears in 
[6], but we include the short proof for completeness. 

Lemma 13. Suppose Y,r>0 and integer K^2 satisfy 

2r\ogK + 2h{Y) — ylog/f < 0. 

Then, for all large n, there exists a code with /f™ codewords in [Kf such that LCS (w, w') < Ynfor all distinct 
w, w' in the code. 


Proof Let w\,... ^wk'" be a sequence of words sampled from [Kf independently at random without re¬ 
placement. For any i < j the joint distribution of (wi,Wj) is same as of two words independently sampled 
from [Kf conditioned on them being distinct. Hence, by the union bound we have 


Pr[LCS(w,,>vy) > yn] ^ 



2 

K-yn_ 


By the second application of the union bound we thus have 


Pr:[3w,w' G ^o, LCS(w,w') ^ yn] ^ = 2 ”( 2 ''iogM(r)-riog^:)+oW < 

for sufficiently large n. As this probability is less than 1, there is a choice of wi,... ,WMn such that pairwise 
LCS is less than yn. □ 


Using the above existential bound in Theorem 3, we now deduce the following. 

Theorem 14 (Existence of deletion codes). Fix an integer k^2. Then for every real number e > 0, there 
is f = (e/k)^^^ such that for infinitely many N there is a code C C of rate at least r and LCS(C) < 

Proof. We first apply Lemma 13 with y = e/4 and r = y/6 = e/24 to get a code Ci C [Kf for K ^0{\/e^) 
with LCS(Ci) ^ en/4 and |Ci| ^ Now applying Theorem 3 to C\ yields a code C 2 C [k]^ with 
LCS(C 2 ) ^ ^2 is at least rfT ^ (e/^)*^i® since T < {k/e)^^^\ □ 

Remark 3. The exponent C?(l/e^) in the rate can be improved to 0(l/e'') for any a > 2. We made the 
concrete choice a = 3 for notational convenience. 
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4.2 Efficient deterministic construction 


Theorem 14 already shows the existence of positive rate codes over the alphabet [/:] which are capable of 

r\ 

correcting a deletion fraction approaching 1 — giving our main combinatorial result. We now turn to 
explicit constructions of such codes. Given Theorem 3, all that we need is an explicit code family capable 
of correcting a deletion fraction approaching 1 over constant-sized alphabets, which is guaranteed by the 
following theorem. 

Lemma 15 ([ 6 ], Thm 3.4). For every 7 > 0 there exists an integer K ^ 0 ( 1 / 7 ^) such that for infinitely many 
block lengths n, one can construct a code C C [K]" of rate f^(7^) andhCS{C) ^ yn in time n(log?i)P°*^(^/^). 
Further, the code C can be efficiently encoded and decoded from a fraction (1 — 7 ) of deletions in n- 
(logn)P°*5'('/7) time. 


Remark 4. The linear dependence on n in the decoding time can be deduced using fast (n - poly (log n) time) 
unique decoding algorithms for Reed-Solomon codes. The bounds stated in [ 6 ] are (log?i)P°*^('/^) time. 


Using the efficiently constructible codes of Lemma 15 in place of random codes as outer codes, we can 
get the constructive analog of Theorem 14 with a similar proof. We also record the statement concerning 
the span of common subsequences of distinct codewords of our code (which is guaranteed by Theorem 3), 
as we will make use of this in the next section on efficiently decodable deletion codes. 


Theorem 16 (Constructive deletion codes). Fix an integer k'^2. Then for every real number e > 0, there is 
r={e/k)^^^ ) such that for infinitely many N, we can construct a code C C in time C?(A^(logA^)P°*y('/®)) 


such that (i) C has rate at least r and (ii) LCS(C) < 


k+Vk 


+ £ ) W' in fact if s is a common subsequence of 


two distinct codewords c,c ^C, then span s ^ {k + s/k) len s — ekN. 


5 Deletion codes with efficient decoding algorithms 


We have already shown how to efficiently construct codes over alphabet [k] that are combinatorially capable 

'y 

of correcting a deletion fraction approaching 1 — However, it is not so clear how to efficiently recover 
the codes in Theorem 16 from deletions. To this end, we now give an alternate explicit construction by con¬ 
catenating codes with large distance for the Flamming metric with good k-ary deletion codes as constructed 
in the previous section. As a side benefit, the construction time will be improved as we will need the codes 
from Theorem 16 for exponentially smaller block lengths. 


5.1 Concatenating Hamming metric codes with deletion codes 

We state our concatenation result abstractly below, and then instantiate with appropriate codes later for 
explicit constructions. Recall that the relative distance (in Hamming metric) of a code C of block length 
n equals the minimum value of A{c,c)/n over all distinct codewords c,c ^ C, where A{x,y) denotes the 
Hamming distance between two words of the same length. 

Lemma 17. Let t],6 G (Ojl]- Let Cout G [g]” be code of relative distance in Hamming metric at least 
(1 TJ ). Let Cin C [k]'” be a code with nQ codewords, one for each (/, ot) G \n\ x \Q\, such that for any two 
distinct codewords ci, C2 G Qn and a common subsequence s of c\, C 2 , we have span s^ (^ + 1) len s — 0km. 
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Consider the code Cconcat ^ [k]^ for N = nm obtained as follows^: There will be a codeword of Cconcai far 
each codeword c o/Cout. obtained by replacing its i’th symbol Ci by the codeword of C\n corresponding to 
{i,Ci). Then we have 

Proof This proof is similar to, but simpler than the proofs of Lemmas 6 and 7. It is simpler because in the 
present situation a codeword of Qn occurs at most once inside a codeword of Cconcat- 

Let c,c be two distinct codewords of Cconcat and let a be a common subsequence of c,c. Recall that 
each codeword of Cconcat can be viewed as a sequence of n (inner) blocks belonging to [kfa, with the /’th 
block encoding (as per Cin) the /’th symbol of the outer codeword. Let us break a into parts based on which 
of the n blocks in c,c its common symbols come from in some canonical (say greedy) way of forming the 
subsequence a from c,c). Let a, j denote the portion of a formed by using symbols from the /’th block 
of c and the /th block of c. Let E be the set of pairs (/,y) for which Oij is not the empty word. If we 
were to draw words c and c as horizontal lines parallel to each other with the n blocks marked as vertically 
aligned points on the lines, and draw the pairs in E as edges between corresponding points, then they would 
be non-crossing. Therefore, ICl ^ 2n. Also, by the construction, the only portions a, y that are formed out 
of the same codeword of Qn are those with / = j and ci = Ci. Thus there are at most rjn such portions, by 
the assumed relative distance of Cout- Combining all this, we have 

span a / ^ span a, y 

(<J)€£ 

/ ( ^ ({k-\-'/k)lenaij — dkm'] j — {k +s/k){r\n)m 
/ {k + s/k)\Qno — IBknm — {k + Vk)r]nm . 

Since span a ^ 2N, we have len a < (/^ + 20 + T]^ A, as desired. □ 

The construction. We now instantiate the above by concatenating Reed-Solomon codes with the codes 
from Theorem 16. Fix the desired alphabet size k / 2 and 7 > 0. 

Let be a large finite field, an infeger i = [^]. Lef Cout be fhe Reed-Solomon encoding code of block 
lengfh n = q fhaf maps degree < i polynomials / G F^[A] fo fheir evaluafions af all poinfs in Note fhaf 
ifs relafive disfance is {q — £ + 1)/q ^ I — 7 / 2 . 

Lef Cin be a k-ary code wifh af leasf q^ codewords consfrucfed in Theorem 16 for £ = 7 / 4 . By fhe 
promised rate of fhaf consfrucfion, fhe block lengfh of Cin can be faken fo be m ^ ^ • log^- Our 

final consfrucfion will apply Lemma 17 fo Cout and Qn wifh paramefers T] = 7/2 and 0 = 7 / 4 , fo gef a code 

Cconcat C [kf for N = qm wifh LCS(Cconcat) ^ + 7 ) N. 

Lef US now esfimafe fhe consfrucfion fime. As a funcfion of N, m ^ Ok^yilogN), and fherefore fhe con¬ 
sfrucfion time for Qn becomes C^:,y(logA(loglogA)P°b(Ce)) Togefher wifh fhe ^(log^)^ fime fo consfrucf 
a represenfafion of F^ and fhe Reed-Solomon code, we gel an overall consfrucfion time of C(Alog^A) for 
large enough N. We record Ibis in fhe following sfafemenf. 

'Note that this is a concatenation of a “position-indexed version” of Com with Cin. 
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Theorem 18 (Reed-Solomon + inner deletion codes with better construction time). Fix an integer k'^ 1. 
Then for every real number 7 > 0, there is r{k, 7 ) = {j/k)^^'^ ) such that for infinitely many and sufficiently 
large N, we can construct a code C C [Z:]^ in time O(A^log^A^) such that 

(i) C has rate at least r{k, 7 ), and 

® LCS(C)< + 


5.2 Deletion correction algorithm 


We now describe an efficient decoding procedure for the codes from Theorem 18. The procedure will 
succeed as long as the fraction of deletions is only slightly smaller than 1 — We describe the basic idea 
before giving the formal statement and proof. If we are given a subsequence 5 of length + 5)N of some 
codeword, then by a simple counting argument, there must be at least dq/2 inner blocks (corresponding to 
the inner encodings of the q indexed Reed-Solomon symbols) in which s contains at least + f)m 

symbols from the corresponding inner codeword. So we can decode the corresponding Reed-Solomon 
symbol (by brute-force) if we knew the boundaries of this block. Since we do not know this, the idea is 
to try decoding all contiguous chunks of size ( 
subsequences beginning at locations which are multiples of 5m/4). 


in ‘S' with sufficient granularity (for example. 


This might result in the decoding of several spurious symbols, but there will be enough correct symbols 
to list decode the Reed-Solomon code and produce a short list that includes the correct message. By the 
combinatorial guarantee on the LCS value of the concatenated code from Theorem 18, only the correct 
message will have an encoding containing 5 as a subsequence. Therefore, we can prune the list and identify 
the correct message by re-encoding each candidate message and checking which one has 5 as a subsequence. 
The list decoding step is similar to the one used in [6] for list decoding binary codes from a fraction of 
deletions approaching 1/2. Since we have the combinatorial guarantee that the code can correct a deletion 
fraction ss 1 — a list decoding algorithm up to this radius is also automatically a unique decoding 

algorithm. 


Theorem 19 (Explicit and efficiently decodable deletion codes). The concatenated code C C [k]^ con¬ 
structed in Theorem 18 can be efficiently decoded from a fraction (l — — 0(7^^^)) of worst-case dele¬ 
tions in time, for large enough N. 


Proof With hindsight, let 5 = 37 ^/^. Suppose we are given a subsequence 5 of an unknown codeword 
c E C (encoding the unknown polynomial / of degree < t), where len^ ^ {f+fk ^ claim that the 

following decoding algorithm recovers c. 


1 . 

2. [Inner decodings] For each integer j, 0 ^ 7 ^ (gm)/ 4 ’ following: 

(a) Let Oj be the contiguous subsequence of s of length + 7 )'” starting at position j -|- 1. 

(b) By a brute-force search over x F^, find the unique pair ( 0 ;, j8), if any, such that its encoding 

under Qn has Oj as a subsequence, and add (o;,j3) to If'. (This pair, if it exists, is unique since 
LCS(Cin) < + l)m, and 5 ^ 7 .) 
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3. 


[Reed-Solomon list recovery] Find the list, call it of all polynomials p E F^[Z] of degree < I 
such that 


{(a,p(a)) I a E F^}n 




dq 

T 


(9) 


4. [Pruning] Find the unique polynomial / E if any, such that its encoding under C contains 5 as a 
subsequence, and output /. 


Correctness. Break the codeword c E [k]”"* of the concatenated code C into n (inner) blocks, with the 
/’th block bi E [k]'” corresponding to the inner encoding of the /’th symbol (a,-,/(a,)) of the outer Reed- 
Solomon codeword. For some fixed canonical way of forming s out of c, denote by Si the portion of s 
consisting of the symbols in the /’th block bi. Call an index i E [n] good if len^,- ^ +1) By a 

simple counting argument, there are at least 5n/2 values of i E [n] that are good. 

For each good index / E [n], one of the inner decodings in Step 2 will attempt to decode a subsequence 
of Si, and therefore will find the pair («,•,/(«,•)). Since there are at least ^ good indices, the condition (9) is 
met for the correct /. Using Sudan’s list decoding algorithm for Reed-Solomon codes [12], one can find the 
list of all degree ^ i. polynomials p E F^[X] such that (a,p(a)) E ^ for more than field elements 

a E Further, this list will have at most y^2|/7|/^ polynomials. 

Since ^ 4q/d, if we pick 5 so that ^ the decoding will succeed. Recalling that 

i= [ ’ •^his condition is met for our choice of 5. 

Runtime. The number of inner decodings performed is 0{ql8) = 0{N), and each inner decoding takes 
< 7 ^(log( 7 )^ N'^iXogN)^^^^) time. The set has size at most 0{q/5) ^ 0{N) for N large enough. The 
Reed-Solomon list decoding algorithm on | many points can be performed in 0{N^) field operations, see 
for instance [10]. So the overall running time of the decoder is at most ■ poly(logA/^). □ 

Remark 5. The cubic runtime in the above construction arose because of the brute-force implementation of 
the inner decodings. One can recursively use the above concatenated codes themselves as the inner codes, in 
place of the codes from Theorem 16. Each of the inner decodings can now be performed in poly (log time, 
for a total time of -poly(logfor Step 2. By using near-linear time implementations of Reed-Solomon 
list decoding [1], one can also perform Step 3 in ^ • poly (log < 7 ) time. Thus one can improve the decoding 
complexity to N -poly(logA^). 


6 Concluding remarks 

The obvious question left open by this work is to determine the exact value of p*{k), the (supremum of the) 
largest fraction of deletions one can correct over alphabet size k with positive rate. Even in the binary case 
we do not dare to have a strong opinion whether the value is 5 , s/l — 1 or some intermediate value, but let 
us close with a few comments. 

When comparing the encodings of two different symbols in our inner code, one codeword looks locally 
like l'^2^'^ (or the other way around) where the other codeword has long stretches (of length S> A) of the 
same symbol (which are equally often I’s and 2’s). It is tempting to introduce one more level of granularity, 
let us call it “micro particles” in these long stretches, in the form of sequences of the form for j E {1,2} 
and B smaller than A. We were unable to use this to improve the bounds of the contraction. It seems like only 
the shortest period in each of the two codewords matter but we do not have a formal statement to support 
this feeling. 
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There are two reasons for subsequences having big spans in our construction. The first reason is that 
the frequencies are different (this is the main mechanism in the clean construction and hence in [2]) and the 
second is the impurities in the form of dirt. The span is large because we discard half of the high frequency 
string and all of the dirt. If the span is to approach 4 times the length of the subsequences, we need the 
fraction of dirt to approach half the length of the string but this seems hard to combine with the intuition 
of being “dirt,” which should be in minority. We suspect that some new mechanism is needed to prove that 
p*{2) = 2 if this is indeed the true answer. 
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